Automating the extraction of data from HTML tables with unknown structure
نویسندگان
چکیده
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of interest within a Web page, recognize attributes and values within the table, pair attributes with values, and form records. Data-integration techniques allow us to match source records with a target schema. Ontologically specified wrappers allow us to extract data from source records into a target schema. Experimental results show that we can successfully locate data of interest in tables and map the data from source HTML tables with unknown structure to a given target database schema. We can thus “directly” query source data with unknown structure through a known target schema.
منابع مشابه
Automating the Extraction of Data from HTML Tables with Unknown Structure
The authors propose a solution to the problem of web information extraction, which aims to extract relevant information out of webpages. However since this is a broad field they have limited their work to information which is available in HTML tables found on the Web and relates to a specific domain of interest. As a running example in their paper, the authors use car advertisements. I suggest ...
متن کاملAutomatic Ontology-Based Knowledge Extraction from Web Documents vs. Automating the Extraction of Data from HTML Tables with Unknown Structure
In this report we compare the papers [AKM + 03] and [ETL03]. We show that the two proposed systems realize different goals with the same or similar underlying technics. • Source data of interest [ETL03] takes web pages containing HTML tables of interest for a given application domain as the input whereas [AKM + 03] considers unstructured text from webpages for the knowledge extraction process. ...
متن کاملAutomatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. The solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to recognize attributes...
متن کاملInformation Extraction from HTML Pages and its Integration
We propose a method of transformation and integration of HTML tables into a common XML list structure. HTML tables tend to have diversified structures, and such integration will help us browse and compare all related information in separate HTML pages simultaneously. This paper focuses on tasks of information extraction from tables and data categorization. For this purpose, we applied three alg...
متن کاملMining Tables from Large Scale HTML Texts
Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Data Knowl. Eng.
دوره 54 شماره
صفحات -
تاریخ انتشار 2005